Skip to content

fix: Add support for unsigned Arrow datatypes in schema conversion #1617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

gkpanda4
Copy link

@gkpanda4 gkpanda4 commented Aug 18, 2025

Which issue does this PR close?

What changes are included in this PR?

Bug Fixes

  • Fixed crash when ArrowSchemaConverter encounters unsigned datatypes
  • Resolved "Unsupported Arrow data type" errors for UInt8/16/32/64

Features

  • Added casting support for unsigned Arrow types
  • UInt8/16 → Int32 (safe casting to larger signed type)
  • UInt32 → Int64 (safe casting to larger signed type)
  • UInt64 → Error (no safe casting option, explicit error with guidance)

Code Changes

  • Enhanced ArrowSchemaConverter primitive() method with unsigned type handling
  • Added comprehensive test: test_unsigned_type_casting() for all unsigned variants

Files Modified

  • crates/iceberg/src/arrow/schema.rs

Impact

✅ No breaking changes - existing functionality preserved
✅ Safe type casting prevents overflow issues
✅ Clear error messages for unsupported UInt64 with alternatives
✅ Follows proven PyIceberg implementation approach

Are these changes tested?

  • All existing schema tests pass
  • New comprehensive test covers UInt8, UInt16, UInt32, UInt64 conversion behavior
  • Test verifies proper casting: UInt8/16→Int32, UInt32→Int64, UInt64→Error

@emkornfield
Copy link

Sorry, new to reviewing (and mostly new to the code base) so take comments with a grain of salt, but this approach seems brittle:

  1. What happens if someone updates the doc field that removes the type information and Arrow RS tries to read the back?
  2. What happens if a non-Arrow RS tries to read data written from Arrow with these fields (in particular int32 and int64)?

It seems a more robust solution would be to:

  1. Convert uint32->int64
  2. Either still block uint64, convert uint64 to a Decimal with an appropriate precision to represent the full range, or use int64 and validate that no values written are outside the appropriate range.

@CTTY
Copy link
Contributor

CTTY commented Aug 20, 2025

I have the same concern as @emkornfield , using doc to determine field type seems unsafe to me. I think casting the type should be fine. This way there would be type loss when converting Iceberg schema back to arrow schema but it should be ok.

Also the Python's implementation can serve as a good reference. Note that Pyiceberg use bid width while arrow-rs only provides primitive_width() that returns width in bytes

@gkpanda4
Copy link
Author

@emkornfield Right, with the current approach, it has potential for silent data corruption because of Arrow's doc field dependency. @CTTY Thanks for the references, I will use this for casting.

My updated approach uses safe bit-width casting for unsigned integer types following the proven iceberg-python implementation

uint8/uint16 → int32: Safe upcast with no overflow risk
uint32 → int64: Safe upcast preserving full uint32 range
uint64 → explicit block: Rather than risk data loss through unsafe conversion, provide clear error guidance directing users to choose between int64 (with range validation) or decimal (with full precision) based on their specific requirements

Let me know if there are any concerns, I will have the changes out otherwise.

// Cast unsigned types based on bit width (following Python implementation)
DataType::UInt8 | DataType::UInt16 | DataType::UInt32 => {
// Cast to next larger signed type to prevent overflow
let bit_width = p.primitive_width().unwrap_or(0) * 8; // Convert bytes to bits

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems superflous, can't you just match on the data types and directly map them?

DataType::UInt8 | DataType::UInt16 => Ok(Type::Primitive(PrimitiveType::Int)
DataType::UInt32 => Ok(Type::Primitive(PrimitiveType::Long)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will make this logic simple

@@ -378,7 +378,24 @@ impl ArrowSchemaVisitor for ArrowSchemaConverter {
DataType::Int8 | DataType::Int16 | DataType::Int32 => {
Ok(Type::Primitive(PrimitiveType::Int))
}
// Cast unsigned types based on bit width (following Python implementation)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Cast unsigned types based on bit width (following Python implementation)
// Cast unsigned types based on bit width to allow for no data loss

I'm sure python compatibility is a direct goal here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will incorporate this


// Test UInt8/UInt16 → Int32 casting
{
let arrow_field = Field::new("test", DataType::UInt8, false).with_metadata(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesn't look like this is testing UInt16?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this scenario

@@ -1717,6 +1734,60 @@ mod tests {
}
}

#[test]
fn test_unsigned_type_casting() {
// Test UInt32 → Int64 casting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to maybe parameterize the non-error cases as least with expected input/output to avoid boiler plate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1717,6 +1734,60 @@ mod tests {
}
}

#[test]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably isn't the right module, but it would probably be nice to have a test that actually exercises writing these types and then reading them back again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented an integration test for unsigned type roundtrip, but discovered that ParquetWriter also requires modification to handle unsigned data conversion. The issue stems from a type mismatch between schema and data.

The problem occurs because schema conversion (arrow_schema_to_schema) transforms the metadata but leaves the actual data unchanged. When writing, Arrow validation fails due to this mismatch.

Copy link
Contributor

@CTTY CTTY Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think writing record batches that contain unsigned types is out of the scope of the original issue and can be tricky:

  • ParquetWriter uses AsyncArrowWriter under the hood
  • AsyncArrowWriter uses an arrow schema that got converted from the Iceberg table schema
  • When converting an Iceberg schema to arrow schema, arrow schema won't have any unsigned types (and I don't think it makes sense to do so unless there is a valid use case)
  • Schema mismatch between record batches and the arrow schema, arrow writer will fail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, from the original issue it seems scope is ambiguous. It seems like this change it makes it possible to create a schema from arrow with unsigned types which might be helpful by itself, but imagine the next thing the user would want to do is actually the write the data?

It seems fine to check this in separately as long as there is a clean failure for the unsigned types (i.e. we don't silently lose data).

Copy link

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, mostly nits. The additional test coverage would be my primary concern, the rests are mostly style nits.

@gkpanda4 gkpanda4 requested review from emkornfield and CTTY August 22, 2025 20:37
Copy link

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should figure out if this actually closes out the original issue or if we should keep it open for write support, but with my limited knowledge these changes seem reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: ArrowSchemaConverter can't handle unsigned datatypes from arrow
3 participants